Analysis of OMICS data, practical 7

Transcriptomics part 1

Eszter Ari

Eötvös Loránd University, Budapest & Biological Research Centre, Szeged

March 28, 2025

Omics…

Overview of the session

Introduction

  • Definition
  • History of transcription profiling
  • RNA sequencing - NGS methods
  • Advantages and disadvantages of RNA-seq

RNA-seq data analysis

  • Quality checking, trimming
  • Read mapping
  • Read counting

Introduction

Transcriptomics

Transcriptome:

  • the entire repertoire of transcripts in a species
  • or cells, organs, individuals, populations, etc.
  • at a specific time or under a specific set of conditions…
  • represents a key link between information encoded in DNA and phenotype

Types of different RNAs:

  • mRNA, rRNA, tRNA
  • Post-transcriptional modificators: small nuclear snRNA, small nucleolar snoRNA, …
  • RNA regulators: micro miRNA, piwi-interacting piRNA, small interfering siRNA

Transcriptomics

  • Basis: the amount of mRNA indicates the level of gene expression and it correlates with the protein level.
  • We can compare the gene expression of different cells, tissues, individuals, populations.
  • We can investigate the effects of different environments on gene expression.
  • These helps us to understand the underlying biological processes.

Historical overview of expression profiling

Low-throughput RNA profiling methods:

  • Northern blots (1977)
  • reverse-transcription PCR (RT-PCR, 1992)
  • quantitative real-time reverse transcription PCR (qRT-PCR)
  • expressed sequence tags (ESTs)

High-throughput RNA profiling methods

  • serial analysis of gene expression (SAGE) (1995)
  • expression microarrays (1999)
  • RNA-seq: massively parallel sequencing of RNA (cDNA) molecules (2008)

How does RNA-Seq work?

Simplistic overview

Advantages of RNA-seq

  • Robustness, high reproducibility
  • High sensitivity
  • “Direct” measurement of gene expression at the mRNA level → absolute(?) abundance of a transcript
  • The sequences of transcribed RNAs can be reconstructed
  • All transcripts – even “novel” ones – can be detected
  • Detecting transcript isoformes and splicing junctions
    • → study alternative splicing, exact start - end sites
    • → updating genome annotation
  • Detecting polymorphisms (SNPs)
    • → study allele-specific expression
  • Can be used on species for which a full genome sequence is not available

Limitations of RNA-seq

  • RNA-seq is a bit more costly than microarrays
    • RNA-seq: more extensive bioinformatic analysis and great computers are required
  • Cannot detect post-transcriptional modifications
  • Nor post-transcriptional regulation:
    • the amount of mRNA transcribed from geneX is not necessarily equal to the amount of proteinX
    • regulation: miRNA …
  • Biases: library size, fragment length, GC content, hexamer priming…
    • → normalization

Bioinformatic workflow

Work-flow of RNA-seq data analysis

  1. Extract expressed RNA, sequencing → fastq file

  2. Pre-mapping quality checking, trimming (filtering)

  3. Read mapping to reference genome OR de novo assembly of transcripts

    • Post mapping quality checking
  4. Read counting

  5. Quantitative Analyses: comparing expression levels

  6. Functional enrichment analysis: GO, pathways…

2. Read mapping to reference genome

Mapping

Align reads to the genome

Specific mappers for RNA-seq

A standard DNA mapper will not map reads span on two exons, splice junctions.

The RNA-seq mapper

Mapping to genome vs. to transcriptome

  • Genome:
    • complete information (?)
  • Transcriptome assembly:
    • is it complete?
    • problem of multiple mappings to alternative isoforms
  • From the same individual OR reference genome of the species?
    • → the reads will not be 100% identical to the reference

Mapping the reads to the reference genome

  • Reference genome:
    • sequence in FASTA file → for mapping
    • (annotation in GTF file → for counting)
  • Mapper software: GSNAP, STAR, TopHat2, Rsubread, etc.
    • indexing the genome
    • mapping the reads to the genome → can be computationally intense
  • Filtering reads
    • both pairs mapped to the same chromosome?
  • Sorting reads
    • by read coordinates OR
    • by read names → depending on the next software in the pipeline

Different properties of RNA-Seq mappers

  • alignment algorithm
  • producing unique / non-unique mappings
  • using the canonical / non-canonical splice sites
  • using annotation of known exons OR not
  • programming language, platform, interface
  • speed, memory requirement, multi-threading possibility

be aware: errors may occur!

Difficulties can occur during mapping

  • pseudogenes: the reads were mapped to something that didn’t express
  • repetitive regions: the reads were mapped to multiple locations
  • reads mapped to intronic and intergenic regions → how should we treat them?
  • identification and quantification of alternative transcripts
  • distinguish between (allele specific) SNPs and sequencing errors

SAM/BAM file

Col Field Type Brief description
1 QNAME String Query template NAME
2 FLAG Int bitwise FLAG
3 RNAME String References sequence NAME
4 POS Int 1- based leftmost mapping POSition
5 MAPQ Int MAPping Quality
6 CIGAR String CIGAR string
7 RNEXT String Ref. name of the mate/next read
8 PNEXT Int Position of the mate/next read
9 TLEN Int observed Template LENgth
10 SEQ String segment SEQuence
11 QUAL String ASCII of Phred-scaled base QUALity+33

SAM/BAM file

FLAG

# Decimal Description of read
1 1 Read paired
2 2 Read mapped in proper pair
3 4 Read unmapped
4 8 Mate unmapped
5 16 Read reverse strand
6 32 Mate reverse strand
7 64 First in pair
8 128 Second in pair
9 256 Not primary alignment
10 512 Read fails platform/vendor quality checks
11 1024 Read is PCR or optical duplicate
12 2048 Supplementary alignment

CIGAR (Concise Idiosyncratic Gapped Alignment Report) codes

  • A compressed representation of an alignment
  • Made up of pairs, e.g. 76H130M
    • Here, “op” is an operation specified as a single character, usually an upper-case letter, codes:
      • match (M), mismatch (X)
      • insertion (I), deletion (D)
      • soft clipping (S), hard clipping (H)
      • two identical letters (=)
    • The “integer” specifies a number of consecutive operations.

Post-mapping QC

  • % of uniquely mapped reads (unambiguous)
  • % of reads mapped to multiple locations (ambiguous)
    • What to do with ambiguous reads?
  • % of soft or hard clipped reads
    • How to treat them?
  • % of unmapped reads: higher compared to the genomic mapping, because of splicing effects

3. Read counting

Read counting

Find reads that map to coding sequence

  • count read(pairs) per gene, exon, transcript
  • → count table
  • CPM, RPKM, FPKM

Genome annotation: GTF (GFF, SAF, …) file:

  • contains the location of exons, genes, other transcripts

Read counting

Complexity of transcription

We have to decide before counting

  • Exclude or include reads, read-pair mapped to exons partially?
    • What does “partially” mean? %
  • Exclude or include reads, read-pairs mapped to introns?
    • depending on the reliability of the genome annotation

Count table with raw counts